Credit Card Users Churn Prediction

Problem Statement

Objective

Data Dictionary

BankChurners.csv - raw dataset of the project

Import necessary packages

Load the dataset

Data Structure and Overview

Shape of the data

Check the data column types

Drop CLIENTNUM column

Convert object to categorical types

Check missing values

Check duplicate values

Check statistical summary of dataset for numerical columns

Check statistical summary of dataset for non-numerical columns

Check unique values for categorical columns

Fix Income_Category

Univariate Data Analysis

Observation on Customer_Age

Observation on Dependent_count

Observation on Months_on_book

Observation on Total_Relationship_Count

Observation on Months_Inactive_12_mon

Observation on Contacts_Count_12_mon

Observation on Credit_Limit

Observation on Total_Revolving_Bal

Observation on Avg_Open_To_Buy

Observation on Total_Amt_Chng_Q4_Q1

Observation on Total_Trans_Amt

Observation on Total_Trans_Ct

Observation on Total_Ct_Chng_Q4_Q1

Observation on Avg_Utilization_Ratio

Observations on non-numerical variables

Observation on Attrition_Flag

Observation on Gender

Observation on Education_Level

Observation on Marital_Status

Observation on Income_Category

Observation on Card_Category

Bivariate Data Analysis

Attrition_Flag vs Customer_Age

Attrition_Flag vs Gender

Attrition_Flag vs Dependent_count

Attrition_Flag vs Education_Level

Attrition_Flag vs Marital_Status

Attrition_Flag vs Income_Category

Attrition_Flag vs Card_Category

Attrition_Flag vs Months_on_book

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Contacts_Count_12_mon

Attrition_Flag vs Credit_Limit

Attrition_Flag vs Total_Revolving_Bal

Attrition_Flag vs Avg_Open_To_Buy

Attrition_Flag vs Total_Amt_Chng_Q4_Q1

Attrition_Flag vs Total_Trans_Amt

Attrition_Flag vs Total_Trans_Ct

Attrition_Flag vs Total_Ct_Chng_Q4_Q1

Attrition_Flag vs Avg_Utilization_Ratio

Summary of EDA

Data Description

Univariate Data Analysis

Bivariate Data Analysis

Data Pre-Processing

Summary of Data Pre-processing

Outlier Treatment

Data Preparation for Modeling

Define dependent variable

Split data into training, validation and testing set

Missing values treatment

Create dummy variables

Summary of Data Preparation

Split data into training, validation and testing set

Missing values treatment

Create dummy variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will churn the bank, but they actually will not - Loss of resources
  2. Predicting a customer will not churn, but they actually will - Loss of opportunities

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Create functions to calculate different metrics and confusion matrix

Model 1 - Logistic Regression with default parameters

Check performance on training set

Check performance on validation set

Model 2 - Decision Tree with default parameters

Check performance on training set

Check performance on validation set

Model 3 - Bagging with default parameters

Check performance on training set

Check performance on validation set

Model 4 - Boosting method with AdaBoost Classifier (default parameters)

Check performance on training set

Check performance on validation set

Model 5 - Boosting method with Gradient Boosting Classifier (default parameters)

Check performance on training set

Check performance on validation set

Model 6 - Boosting method with XGBoost Classifier (default parameters)

Check performance on training set

Check performance on validation set

Summary on Model Buildings with default hyperparameters

Models for training data

Models for validation data

Logistic Regression

Decision Tree

Bagging classifier

AdaBoost classifier

Gradient Boost classifier

XGBoost classifier

------------------ Overall ------------------

Model Building - Using KFold and cross_val_score

Model Building - Oversampling data

Oversampling train data using SMOTE

Model 1 - Logistic Regression with oversampled data

Check performance on training set

Check performance on validation set

Model 2 - Decision Tree with oversampled data

Check performance on training set

Check performance on validation set

Model 3 - Bagging with oversampled data

Check performance on training set

Check performance on validation set

Model 4 - AdaBoost with oversampled data

Check performance on training set

Check performance on validation set

Model 5 - Gradient Boost with oversampled data

Check performance on training set

Check performance on validation set

Model 6 - XGBoost with oversampled data

Check performance on training set

Check performance on validation set

Summary of model building with oversampled data

Models for training set

Models for validation set

Logistic Regression with oversampled data

Decision Tree with oversampled data

Bagging with oversampled data

AdaBoost with oversampled data

Gradient Boost with oversampled data

XGBoost with oversampled data

--------- Overall ---------

Model Building - Undersampling data

Undersampling using Random Undersampler

Model 1 - Logistic Regression with undersampled data

Check performance of training set

Check performance on validation set

Model 2 - Decision Tree with undersampled data

Check performance on training set

Check performance on validation set

Model 3 - Bagging with undersampled data

Check performance on training set

Check performance on validation set

Model 4 - AdaBoost with undersampled data

Check performance on training set

Check performance on validation set

Model 5 - Gradient Boost with undersampled data

Check performance on training set

Check performance on validation set

Model 6 - XGBoost with undersampled data

Check performance on training set

Check performance on validation set

Summary of model building with undersampled data

Model for training set

Models for validation set

Logistic Regression with undersampled data

Decision Tree with undersampled data

Bagging with undersampled data

AdaBoost with undersampled data

Gradient Boost with undersampled data

XGBoost with undersampled data

---------- Overall ----------

Hyperparameter tuning using RandomizedSearchCV

Model 1: Gradient Boost from default hyperparameters

Check performance on training set

Check performance on validation set

Model 2 - Gradient Boost from oversampling

Check performance on training set

Check performance on validation set

Model 3 - AdaBoost from undersampling

Check performance on training set

Check performance on validation set

Summary of model buildings using RandomizedSearchCV

Models for training set

Models for validation set

Gradient Boost from default hyperparameters

Gradient Boost from oversampling

AdaBoost from undersampling

--------- Overall ---------

Check feature importance on best model

Pipelines for productionizing the model

Column Transformer

Numerical columns

Categorical columns

Combine numeric and categorical columns

Split data into training and testing set

Model building

Check performance on training set

Check performance on testing set

Conclusion

Business Recommendations